Skip to content

CCM-16073 - Enhanced callbacks#145

Draft
rhyscoxnhs wants to merge 2 commits intomainfrom
feature/CCM-16073
Draft

CCM-16073 - Enhanced callbacks#145
rhyscoxnhs wants to merge 2 commits intomainfrom
feature/CCM-16073

Conversation

@rhyscoxnhs
Copy link
Copy Markdown
Contributor

Description

Context

Type of changes

  • Refactoring (non-breaking change)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would change existing functionality)
  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • I am familiar with the contributing guidelines
  • I have followed the code style of the project
  • I have added tests to cover my changes
  • I have updated the documentation accordingly
  • This PR is a result of pair or mob programming

Sensitive Information Declaration

To ensure the utmost confidentiality and protect your and others privacy, we kindly ask you to NOT including PII (Personal Identifiable Information) / PID (Personal Identifiable Data) or any other sensitive data in this PR (Pull Request) and the codebase changes. We will remove any PR that do contain any sensitive information. We really appreciate your cooperation in this matter.

  • I confirm that neither PII/PID nor sensitive data are included in this PR and the codebase changes.

@rhyscoxnhs rhyscoxnhs requested review from a team as code owners April 15, 2026 10:34
@rhyscoxnhs rhyscoxnhs changed the title CCM-16073 - Enhanced callbacks [skip ci] CCM-16073 - Enhanced callbacks Apr 15, 2026
@rhyscoxnhs rhyscoxnhs requested a review from Copilot April 15, 2026 10:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces enhanced callback delivery security and operational controls by adding mTLS + certificate pinning configuration to callback targets, and shifting delivery from direct EventBridge API Destinations to an SQS + per-client HTTPS delivery Lambda model (with Redis-backed gating for rate limiting/circuit breaking).

Changes:

  • Add mtls, certPinning, and delivery fields to callback target model + schema validation, and update fixtures/tests accordingly.
  • Add CLI commands to manage mTLS, certificate pinning enable/disable, and SPKI hash extraction/storage for targets.
  • Add new https-client-lambda (delivery, signing, retries, DLQ handling, Redis/Lua gate) and new shared config-cache package; update Terraform to provision per-client delivery infra (SQS/Lambda/ElastiCache) and mock mTLS ALB.

Reviewed changes

Copilot reviewed 104 out of 107 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tools/client-subscriptions-management/src/entrypoint/cli/targets-set-pinning.ts New CLI command to enable/disable certificate pinning for a target.
tools/client-subscriptions-management/src/entrypoint/cli/targets-set-mtls.ts New CLI command to enable/disable mTLS for a target.
tools/client-subscriptions-management/src/entrypoint/cli/targets-set-certificate.ts New CLI command to extract/store SPKI hash from PEM for a target.
tools/client-subscriptions-management/src/entrypoint/cli/clients-put.ts Minor cleanup in CLI file handling comment.
tools/client-subscriptions-management/src/domain/client-subscription-builder.ts Adds default mtls/certPinning and emits security warnings when building targets.
tools/client-subscriptions-management/src/tests/helpers/client-subscription-fixtures.ts Updates test target fixtures with required mtls/certPinning fields.
tools/client-subscriptions-management/src/tests/entrypoint/cli/targets-set-pinning.test.ts Unit tests for targets-set-pinning CLI behavior.
tools/client-subscriptions-management/src/tests/entrypoint/cli/targets-set-mtls.test.ts Unit tests for targets-set-mtls CLI behavior.
tools/client-subscriptions-management/src/tests/entrypoint/cli/targets-set-certificate.test.ts Unit tests for targets-set-certificate CLI behavior.
tools/client-subscriptions-management/src/tests/domain/client-subscription-builder.test.ts Adds tests around security warning emission in target builder.
tools/client-subscriptions-management/package.json Adds picocolors dependency for warning output.
tests/integration/helpers/mock-client-config.ts Adds a new integration fixture key for an mTLS-enabled mock client.
tests/integration/helpers/event-factories.ts Adds factory for delivery messages compatible with the new delivery flow.
tests/integration/fixtures/subscriptions/mock-client-rate-limit.json New integration fixture including required mtls/certPinning.
tests/integration/fixtures/subscriptions/mock-client-mtls.json New integration fixture for mTLS + pinning enabled target.
tests/integration/fixtures/subscriptions/mock-client-circuit-breaker.json New integration fixture including delivery circuit breaker configuration.
tests/integration/fixtures/subscriptions/mock-client-2.json Updates existing integration fixture targets to include mtls/certPinning.
tests/integration/fixtures/subscriptions/mock-client-1.json Updates existing integration fixture targets to include mtls/certPinning.
src/models/src/client-config.ts Extends callback target type with mtls, certPinning, and optional delivery.
src/models/src/client-config-schema.ts Adds Zod validation for mtls, certPinning (incl. SPKI hash constraints), and delivery.
src/models/src/tests/client-config-schema.test.ts Adds/updates schema tests for new target fields and constraints.
src/config-cache/tsconfig.json New package tsconfig for config-cache workspace.
src/config-cache/src/index.ts Exports ConfigCache from new workspace package.
src/config-cache/src/config-cache.ts Implements TTL-based in-memory config cache.
src/config-cache/src/tests/config-cache.test.ts Unit tests for ConfigCache TTL and behaviors.
src/config-cache/package.json New workspace package definition for config-cache.
src/config-cache/jest.config.ts Jest config for config-cache package.
scripts/config/pre-commit.yaml Adjusts pre-commit detect-private-key exclusions for a test file.
pnpm-workspace.yaml Adds workspace catalog entries for @redis/client, picocolors, and Secrets Manager SDK.
lambdas/mock-webhook-lambda/src/index.ts Adds ALB mTLS client-cert header validation for mock webhook endpoint.
lambdas/mock-webhook-lambda/src/tests/index.test.ts Adds unit tests for ALB mTLS certificate verification flow.
lambdas/https-client-lambda/tsconfig.json New lambda workspace tsconfig.
lambdas/https-client-lambda/src/services/ssm-applications-map.ts Loads and caches clientId→applicationId map from SSM.
lambdas/https-client-lambda/src/services/sqs-visibility.ts SQS ChangeMessageVisibility helper.
lambdas/https-client-lambda/src/services/record-result.lua Redis Lua script to update circuit breaker state.
lambdas/https-client-lambda/src/services/payload-signer.ts Adjusts payload signer function signature (parameter order).
lambdas/https-client-lambda/src/services/logger.ts Re-exports shared logger for lambda local imports.
lambdas/https-client-lambda/src/services/endpoint-gate.ts Redis admission + circuit-breaker integration (EVALSHA/EVAL Lua execution).
lambdas/https-client-lambda/src/services/dlq-sender.ts DLQ send helper.
lambdas/https-client-lambda/src/services/delivery/tls-agent-factory.ts Builds TLS agent with optional mTLS material and SPKI pinning.
lambdas/https-client-lambda/src/services/delivery/retry-policy.ts Retry/backoff and Retry-After parsing helpers.
lambdas/https-client-lambda/src/services/delivery/https-client.ts HTTPS delivery client + result classification.
lambdas/https-client-lambda/src/services/delivery-metrics.ts Embedded metrics emission for delivery and circuit-breaker events.
lambdas/https-client-lambda/src/services/config-loader.ts Loads client config/targets from S3 with TTL caching and schema validation.
lambdas/https-client-lambda/src/services/admit.lua Redis Lua script for token-bucket rate limiting + CB admission.
lambdas/https-client-lambda/src/lua.d.ts Type declaration for importing .lua as text.
lambdas/https-client-lambda/src/index.ts Lambda entrypoint delegating to record processor.
lambdas/https-client-lambda/src/handler.ts Main SQS batch handler: load config, sign, gate, deliver, retry/DLQ, metrics.
lambdas/https-client-lambda/src/tests/tls-agent-factory.test.ts Unit tests for TLS agent factory behavior (S3/SecretsManager/pinning).
lambdas/https-client-lambda/src/tests/ssm-applications-map.test.ts Unit tests for SSM applications map loading/caching/errors.
lambdas/https-client-lambda/src/tests/sqs-visibility.test.ts Unit tests for SQS visibility changes.
lambdas/https-client-lambda/src/tests/retry-policy.test.ts Unit tests for retry policy helpers.
lambdas/https-client-lambda/src/tests/payload-signer.test.ts Unit tests for payload signing.
lambdas/https-client-lambda/src/tests/index.test.ts Unit test for lambda entrypoint wiring.
lambdas/https-client-lambda/src/tests/https-client.test.ts Unit tests for delivery HTTP behavior classification.
lambdas/https-client-lambda/src/tests/handler.test.ts Unit tests for SQS record processing paths (DLQ/retry/gate/CB).
lambdas/https-client-lambda/src/tests/endpoint-gate.test.ts Unit tests for Redis Lua invocation paths and client creation.
lambdas/https-client-lambda/src/tests/dlq-sender.test.ts Unit tests for DLQ sender.
lambdas/https-client-lambda/src/tests/delivery-metrics.test.ts Unit tests for embedded metrics behavior.
lambdas/https-client-lambda/src/tests/config-loader.test.ts Unit tests for S3 config loading + TTL cache.
lambdas/https-client-lambda/package.json New lambda workspace package definition.
lambdas/https-client-lambda/lua-transform.js Jest transform to load .lua scripts as strings.
lambdas/https-client-lambda/jest.config.ts Jest config enabling .lua transform.
lambdas/client-transform-filter-lambda/src/services/observability.ts Removes callback signing observability (signing moved downstream).
lambdas/client-transform-filter-lambda/src/services/config-loader.ts Switches to shared config-cache package import.
lambdas/client-transform-filter-lambda/src/services/config-loader-service.ts Switches to shared config-cache package import.
lambdas/client-transform-filter-lambda/src/index.ts Removes ApplicationsMapService wiring from handler creation.
lambdas/client-transform-filter-lambda/src/handler.ts Removes per-target signatures from output; outputs deliverable payload + subscriptions only.
lambdas/client-transform-filter-lambda/src/tests/services/payload-signer.test.ts Removes tests for payload signing in transform-filter lambda.
lambdas/client-transform-filter-lambda/src/tests/services/config-update.component.test.ts Updates imports to new config-cache package.
lambdas/client-transform-filter-lambda/src/tests/services/config-loader.test.ts Updates imports to new config-cache package.
lambdas/client-transform-filter-lambda/src/tests/services/config-cache.test.ts Updates imports to new config-cache package.
lambdas/client-transform-filter-lambda/src/tests/index.test.ts Updates tests to reflect removal of signatures and apps map dependency.
lambdas/client-transform-filter-lambda/src/tests/index.component.test.ts Updates component tests to reflect new payload shape and removed SSM dependency.
lambdas/client-transform-filter-lambda/src/tests/helpers/client-subscription-fixtures.ts Updates target fixtures to include required mtls/certPinning.
lambdas/client-transform-filter-lambda/package.json Adds config-cache workspace dependency.
knip.ts Updates Knip workspace config (new workspaces and integration entrypoints).
infrastructure/terraform/modules/client-destination/variables.tf Removes legacy API Destination-based module.
infrastructure/terraform/modules/client-destination/module_target_dlq.tf Removes legacy per-target DLQ module.
infrastructure/terraform/modules/client-destination/locals.tf Removes legacy locals for old module.
infrastructure/terraform/modules/client-destination/iam_role_api_target_role.tf Removes legacy IAM role/policy for API destinations.
infrastructure/terraform/modules/client-destination/cloudwatch_event_rule_main.tf Removes legacy EventBridge rule/target to API destination setup.
infrastructure/terraform/modules/client-destination/cloudwatch_event_connection_main.tf Removes legacy EventBridge Connection resources.
infrastructure/terraform/modules/client-destination/cloudwatch_event_api_destination_this.tf Removes legacy API Destination resources.
infrastructure/terraform/modules/client-destination/README.md Removes docs for legacy module.
infrastructure/terraform/modules/client-delivery/variables.tf Adds new per-client delivery module variables.
infrastructure/terraform/modules/client-delivery/outputs.tf Adds outputs for per-client queues and lambda details.
infrastructure/terraform/modules/client-delivery/module_sqs_per_client.tf Provisions per-client delivery SQS queue.
infrastructure/terraform/modules/client-delivery/module_https_client_lambda.tf Provisions per-client https-client lambda + SQS event source mapping.
infrastructure/terraform/modules/client-delivery/module_dlq_per_client.tf Provisions per-client DLQ + DLQ depth alarm.
infrastructure/terraform/modules/client-delivery/locals.tf Defines naming/tagging locals for per-client module.
infrastructure/terraform/modules/client-delivery/iam_role_sqs_target.tf IAM policy for https-client lambda (SQS/S3/SSM/KMS and optional SecretsManager/S3 cert).
infrastructure/terraform/modules/client-delivery/cloudwatch_event_rule_per_subscription.tf Defines EventBridge rules/targets routing to per-client SQS delivery queue.
infrastructure/terraform/modules/client-delivery/README.md Adds docs for new per-client delivery module.
infrastructure/terraform/components/callbacks/variables.tf Enables X-Ray by default and adds mTLS-related component variables.
infrastructure/terraform/components/callbacks/pipes_pipe_main.tf Removes signatures from Pipe output template.
infrastructure/terraform/components/callbacks/module_mock_webhook_alb_mtls.tf Adds internal ALB + ACM import + passthrough mTLS wiring for mock webhook.
infrastructure/terraform/components/callbacks/module_client_destination.tf Removes legacy client_destination module usage.
infrastructure/terraform/components/callbacks/module_client_delivery.tf Adds per-client delivery module instantiation.
infrastructure/terraform/components/callbacks/locals.tf Reworks locals for per-client subscriptions/targets and mTLS mock endpoint selection.
infrastructure/terraform/components/callbacks/elasticache_delivery_state.tf Adds ElastiCache Serverless Redis for delivery state + security groups/alarms.
infrastructure/terraform/components/callbacks/cloudwatch_metric_alarm_dlq_depth.tf Removes legacy per-target DLQ alarm (replaced by per-client alarms).
infrastructure/terraform/components/callbacks/cloudwatch_eventbus_main.tf Adds EventBridge archive for 7-day retention.
infrastructure/terraform/components/callbacks/README.md Updates docs to reflect new modules, variables, and defaults.
eslint.config.mjs Ignores lua-transform.js and expands rule scope to include src workspaces.
.gitleaksignore Adds ignore for test private key fixture in tls-agent-factory tests.
Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lambdas/https-client-lambda/src/services/delivery/https-client.ts
Comment thread lambdas/https-client-lambda/src/services/delivery/https-client.ts
@mjewildnhs mjewildnhs marked this pull request as draft April 15, 2026 10:45
Copy link
Copy Markdown
Contributor

@aidenvaines-cgi aidenvaines-cgi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the TF stuff all looks cool from an rough check

@@ -0,0 +1,178 @@
resource "aws_elasticache_serverless_cache" "delivery_state" {
name = "${local.csi}-delivery-state"
engine = "redis"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valkey is redis compatible, is half the cost and faster. Can we use that?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just waiting on agreement.

Comment thread infrastructure/terraform/components/callbacks/module_client_destination.tf Outdated
@@ -0,0 +1,39 @@
module "sqs_delivery" {
source = "https://github.com/NHSDigital/nhs-notify-shared-modules/releases/download/3.0.7/terraform-sqs.zip"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
source = "https://github.com/NHSDigital/nhs-notify-shared-modules/releases/download/3.0.7/terraform-sqs.zip"
source = "https://github.com/NHSDigital/nhs-notify-shared-modules/releases/download/3.0.9/terraform-sqs.zip"

3.0.9 exists, 3.0.7 is being used here. minor, ignore if you want

type = bool
description = "Enable AWS X-Ray active tracing for Lambda functions"
default = false
default = true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think weve historically had xray off by default in prod as its quite spendy, and enabling it as and when needed.

All variables should be prod defaults with PTL envs having overrides

log_destination_arn = var.log_destination_arn
log_subscription_role_arn = var.log_subscription_role_arn

lambda_env_vars = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please sort alphabetically, it makes me sad

import type { SQSRecord } from "aws-lambda";

import { processRecords } from "handler";

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of mock setup at the beginning of this. Could we extract this out into a test helper or maybe a fixture so that the test file is cleaner and more readable?

firstReceivedMs: number,
maxRetryDurationMs: number,
): boolean {
return Date.now() - firstReceivedMs >= maxRetryDurationMs;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a little hard to read, could do with parens around the first operation

Comment on lines +13 to +16
const { MTLS_CERT_SECRET_ARN } = process.env;
const { MTLS_TEST_CERT_S3_BUCKET } = process.env;
const { MTLS_TEST_CERT_S3_KEY } = process.env;
const { MTLS_TEST_CA_S3_KEY } = process.env;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These destructures should be merged ideally

Comment on lines +78 to +83
// eslint-disable-next-line sonarjs/null-dereference -- loadS3Object always returns a string
const parts = pem.split(/(?=-----BEGIN )/);
// eslint-disable-next-line sonarjs/null-dereference -- .find returns string|undefined, fallback handles it
const key = parts.find((p) => p.includes("PRIVATE KEY")) ?? "";
// eslint-disable-next-line sonarjs/null-dereference -- as above
const cert = parts.find((p) => p.includes("CERTIFICATE")) ?? "";
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we either disable this lint at the top or find ways to fix this lint properly.

Comment on lines +55 to +64
/* eslint-disable sonarjs/null-dereference -- refillPerSec is typed as number, cannot be null */
const args = [
now,
refillPerSec.toString(),
config.burstCapacity.toString(),
config.cbProbeIntervalMs.toString(),
cbEnabled ? "1" : "0",
config.decayPeriodMs.toString(),
];
/* eslint-enable sonarjs/null-dereference */
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we're disabling this lint?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to try it without to see if it fails?

@@ -0,0 +1,474 @@
import type { SQSRecord } from "aws-lambda";

Copy link
Copy Markdown
Contributor

@gparry333 gparry333 Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do with a test being added that tests that an unexpected error being thrown on one record doesn't prevent a subsequent record being processed e.g. An unexpected error thrown by an upstream dependency like loadTargetConfig which would be caught by the try/catch in processRecords - loadTargetConfig throwing "S3 unavailable" for record 1, while record 2 should still be processed normally.

{ type: "MessageStatus", attributes: { messageStatus: "delivered" } },
],
}) as Parameters<typeof signPayload>[2];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also have a negative test for this that verifies an empty payload produces a deterministic (non-empty) signature, ensuring the signing function doesn't silently fail on edge-case payloads?

Copy link
Copy Markdown
Contributor

@mjewildnhs mjewildnhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as reviewing the terraform

@@ -0,0 +1,178 @@
resource "aws_elasticache_serverless_cache" "delivery_state" {
name = "${local.csi}-delivery-state"
engine = "redis"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cost saving I think we should switch to valkey.
You are billed in gigabyte-hours (GB-hrs) and the minimum for redis is 1GB vs 100mb in valkey.
Not sure we'll go above 100mb even in prod.
https://aws.amazon.com/elasticache/pricing/

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just waiting on agreement.

Comment thread .gitleaksignore
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to the phase 2 work but I think we can delete lines 3-6 - those secrets have never been in our repo at any point


cache_usage_limits {
data_storage {
maximum = 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1GB is the minimum with redis but can go down to 100mb if we make the valkey switch.
Keeping this low for dev/test environments is good for cost saving.
We should see how much each client will take in storage.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want an alarm on storage - e.g. 80% used

certificate_arn = aws_acm_certificate_import.mock_webhook_server[0].arn

mutual_authentication {
mode = "passthrough"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With passthrough we should be able route mTLS and non mTLS traffic through and remove the aws_lambda_permission config in module_mock_webhook_lambda.
We'd then need to modify the lambda to output whether it was mTLS or not which the tests can then verify based on their expectation.
This would simplify things a lot and also get us the hardening we are looking for on the mock webhook.

MAX_RETRY_DURATION_SECONDS = tostring(var.max_retry_duration_seconds)
MTLS_CERT_SECRET_ARN = var.mtls_cert_secret_arn
MTLS_TEST_CERT_S3_BUCKET = var.mtls_test_cert_s3_bucket
MTLS_TEST_CERT_S3_KEY = var.mtls_test_cert_s3_key
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the lambda is also missing MTLS_TEST_CA_S3_KEY

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think any of these are referenced?
AI seems to like adding outputs for no reason.


variable "mtls_cert_secret_arn" {
type = string
description = "Secrets Manager ARN for the mTLS client certificate (production)"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my other comment I think we need to review the production vs non-production terminology we are using. Production should be assumed and non-production should be "dev" - this aligns with the tfvar groups.

Comment on lines +160 to +164
variable "deploy_mock_clients" {
type = bool
description = "Whether mock client infrastructure is deployed (non-prod only)"
default = false
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
variable "deploy_mock_clients" {
type = bool
description = "Whether mock client infrastructure is deployed (non-prod only)"
default = false
}

Not used in the module.

Copy link
Copy Markdown
Contributor

@mjewildnhs mjewildnhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done transform filter lambda changes - in middle of http lambda

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete now its in a shared module. This is just a duplicate of src/config-cache/src/__tests__/config-cache.test.ts


export const configLoaderService = new ConfigLoaderService();

export const applicationsMapService = new ApplicationsMapService();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its never very good at cleaning up after itself. Theres a bunch of dead code to remove. AI summary:

Files to delete:

File Reason
src/services/ssm-applications-map.ts Dead code — no longer imported in index.ts or handler.ts
src/__tests__/services/ssm-applications-map.test.ts Tests the deleted service

Knock-on changes required:

  1. package.json — remove @aws-sdk/client-ssm from dependencies (only used in ssm-applications-map.ts and its test)

  2. src/__tests__/index.component.test.ts — needs cleanup:

    • Remove the jest.mock("@aws-sdk/client-ssm", ...) block (lines 18–26)
    • Remove mockSsmSend variable and its usages
    • Remove APPLICATIONS_MAP_PARAMETER from process.env setup/teardown
    • Remove CLIENT_SUBSCRIPTION_CONFIG_BUCKET, CLIENT_SUBSCRIPTION_CONFIG_PREFIX, CLIENT_SUBSCRIPTION_CACHE_TTL_SECONDS env vars — these are now the wrong names (should be CLIENT_CONFIG_BUCKET etc. per the earlier fix), and the SSM-related ones are dead
    • Remove mockSsmSend.mockResolvedValue(...) from beforeEach
    • Remove mockSsmSend.mockClear() from beforeEach

    Note: the SSM mock was used to simulate ApplicationsMapService loading the applications map — since that's gone from this lambda, it's entirely redundant. The applicationsMap const and mockSsmSend usages can all go.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to remove APPLICATIONS_MAP_PARAMETER from the terraform module for the lambda.

@@ -2,6 +2,7 @@
"dependencies": {
"@aws-sdk/client-s3": "catalog:aws",
"@aws-sdk/client-ssm": "catalog:aws",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a .luarc.json file at the root level like we had on the PoC branch - otherwise the IDE won't be happy with the undefined globals
The following works:

{
  "diagnostics": {
    "globals": [
      "KEYS",
      "ARGV",
      "redis"
    ]
  }
}

Comment on lines +97 to +99
const maxRetryDurationMs =
(target.delivery?.maxRetryDurationSeconds ??
DEFAULT_MAX_RETRY_DURATION_MS / 1000) * 1000;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol this is crazy to read

Suggested change
const maxRetryDurationMs =
(target.delivery?.maxRetryDurationSeconds ??
DEFAULT_MAX_RETRY_DURATION_MS / 1000) * 1000;
const maxRetryDurationMs =
target.delivery?.maxRetryDurationSeconds !== undefined
? target.delivery.maxRetryDurationSeconds * 1000
: DEFAULT_MAX_RETRY_DURATION_MS;

async function processRecord(record: SQSRecord): Promise<void> {
const { CLIENT_ID } = process.env;
if (!CLIENT_ID) {
throw new Error("CLIENT_ID is required");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is misconfigured then the behaviour will be undesirable - SQS will just keep retrying it 100 times.
However it will never happen so I don't think its worth explicitly handling it and treating the events as permanent failures (by sending to the DLQ).

if (cbEnabled) {
await recordResult(redis, targetId, true, gateConfig);
}
emitDeliverySuccess(targetId);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I preferred the observability stuff we had in the client transform filter lambda (lambdas/client-transform-filter-lambda/src/services/observability.ts) which wrapped the logging and metric calls into one. It cuts down on the clutter a bit by reducing it into 1 call.

{
hostname: url.hostname,
port: url.port || 443,
path: url.pathname + url.search,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may face URL encoding issues here. https.request does not perform URL escaping, so spaces may result in malformed HTTPS requests.

Instead, since https.request supports a URL object for path, we should pass through the url object we already construct before making the request.

const SQS_MAX_VISIBILITY_SECONDS = 43_200;

export function jitteredBackoffSeconds(receiveCount: number): number {
const ceiling = Math.min(5 * 2 ** (receiveCount - 1), BACKOFF_CAP_SECONDS);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have proper constant naming here for 5, 2, etc so it's understandable at a high level

MTLS_TEST_CERT_S3_KEY,
);
// eslint-disable-next-line sonarjs/null-dereference -- loadS3Object always returns a string
const parts = pem.split(/(?=-----BEGIN )/);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make use of a proper npm package for pem file parsing, that way we're not hand rolling our own parsing code which could result in security vulnerabilities from malformed data.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - there must be something off the shelf somewhere for this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use secrets manager for everything we won't have this problem.

let cache: ConfigCache | undefined;

function getCache(): ConfigCache {
if (!cache) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be simplified like so:

cache ??= new ConfigCache((Number(process.env.CLIENT_SUBSCRIPTION_CACHE_TTL_SECONDS) || 300) * 1000);

@@ -0,0 +1 @@
export * from "@nhs-notify-client-callbacks/logger";
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be re-exporting like this, we should instead directly use the @nhs-notify-client-callbacks/logger import.

const target = await loadTargetConfig(CLIENT_ID, targetId);
const maxRetryDurationMs =
(target.delivery?.maxRetryDurationSeconds ??
DEFAULT_MAX_RETRY_DURATION_MS / 1000) * 1000;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it we're dividing by 1000 just to then multiply by 1000 later? It may make more sense to make the default be in seconds to simplify all of this.

return;
}

if ("retryAfterHeader" in result) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing an in check here? Can we strongly type result instead to avoid this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its the discriminated union type ok and permanent are both false so it doesn't know if its a rate limited result or other status code result.
I've suggested an alternative type which is easier to understand and uses a "type" as a discriminator.

Copy link
Copy Markdown
Contributor

@mjewildnhs mjewildnhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Middle of http client lambda review - done with the handler, client, retry policy

): Promise<SQSBatchItemFailure[]> {
const failures: SQSBatchItemFailure[] = [];

for (const record of records) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try and processRecords in parallel up to a configurable concurrency limit.

throw new Error("Rate limited — requeue");
}

async function processRecord(record: SQSRecord): Promise<void> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is too long and needs refactoring as its hard to follow. Especially with the early returns if result.ok == true.
Can it be refactored so we broadly have (with other bits between):

checkAdmission()
deliverPayload()
handleDeliveryResult()

targetId: string;
};

async function handleRateLimited(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a lot of this should be refactored to lambdas/https-client-lambda/src/services/delivery/retry-policy.ts we then wouldn't need to export/import as much from retry-policy and keep retry specific calculations out of the root handler.

}

if ("retryAfterHeader" in result) {
await handleRateLimited(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like any metrics are recorded when we get a 429.
Feels like we should record this.
Are there any other important gaps in the flow where metrics aren't recorded?

return cachedMaterial;
}

const PERMANENT_TLS_ERROR_CODES = new Set([
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit of a nit-pick but this needs to be somewhere near the top of the file.
And shouldn't ERR_CERT_PINNING_FAILED be included in this list?

ELASTICACHE_IAM_USERNAME = var.elasticache_iam_username
}

vpc_config = var.lambda_security_group_id != "" ? {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the security group defined?
Am I missing something?
We need to consider what the lambda can talk to and on what ports - the lambda supports a port being in the webhook URL.

return new Promise((resolve) => {
const url = new URL(target.invocationEndpoint);

const req = https.request(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like we've any kind of timeout handling in here which I think we need. We don't want to wait for a 5m socket timeout or something.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to run the risk of the lambda timing out on request 1000/1000

export function jitteredBackoffSeconds(receiveCount: number): number {
const ceiling = Math.min(5 * 2 ** (receiveCount - 1), BACKOFF_CAP_SECONDS);
// eslint-disable-next-line sonarjs/pseudo-random -- jitter for backoff, not security-sensitive
return Math.floor(Math.random() * ceiling);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to potentially have 0 seconds retry here or do we want a minimum of 1 second?

Suggested change
return Math.floor(Math.random() * ceiling);
return Math.max(1, Math.floor(Math.random() * ceiling))

Comment on lines +17 to +18
const CERT_EXPIRY_THRESHOLD_MS =
Number(process.env.CERT_EXPIRY_THRESHOLD_MS) || 86_400_000;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const CERT_EXPIRY_THRESHOLD_MS =
Number(process.env.CERT_EXPIRY_THRESHOLD_MS) || 86_400_000;
const CERT_EXPIRY_THRESHOLD_MS =
Number(process.env.CERT_EXPIRY_THRESHOLD_MS) || 86_400_000; // 24 hours

MTLS_TEST_CERT_S3_KEY,
);
// eslint-disable-next-line sonarjs/null-dereference -- loadS3Object always returns a string
const parts = pem.split(/(?=-----BEGIN )/);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - there must be something off the shelf somewhere for this.

}

async function loadCertMaterial(): Promise<CertMaterial> {
const isProduction = Boolean(MTLS_CERT_SECRET_ARN);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should revisit the decision to have the difference in cert handling between environments.
It adds a lot of complexity and unless we always test for it in non-prod there's a big danger we won't realize its broken until we've released it.
I seem to remember it was because were were concerned about costs for ephemeral environments.
Its priced per secret per month (at $0.40) - doesn't that mean we'll only be billed for a fraction of the based on how long the env was deployed?
https://aws.amazon.com/secrets-manager/pricing/

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least loadFromSecretsManager is trivial but the loadFromS3 function is currently awful.
Theres a tonne of env vars as well backing the S3 solution.

MTLS_TEST_CERT_S3_KEY,
);
// eslint-disable-next-line sonarjs/null-dereference -- loadS3Object always returns a string
const parts = pem.split(/(?=-----BEGIN )/);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use secrets manager for everything we won't have this problem.

Copy link
Copy Markdown
Contributor

@mjewildnhs mjewildnhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the http-lambda (minus unit tests)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of this is similar to the transform filter lambdas config-loader (get from s3, parse, validate?, set in the cache, return).
Maybe we should replace @nhs-notify-client-callbacks/config-cache with @nhs-notify-client-callbacks/config-subscription-cache?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite a bit different to the spec implementation.
I'm going to ignore for now while rate limiting not decided but the spec implementation I think is better with:

  • TTL
  • batched HMGET read
  • structured return
  • better commenting
  • bunch of other minor algorithm things

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with the admit script (1 to come back to - https://github.com/NHSDigital/nhs-notify-client-callbacks/pull/145/changes#r3093774270) but I think this is worse than the spec impl.

admitSha = computeSha1(admitLuaSrc);
}

try {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we refactor this into a evalScript function as both admit and recordResult do similar things (EVALSHA, NOSCRIPT error handling)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a TTL for the cache.
Not sure why it hasn't just copied the file from the transform filter lambda.

});
}

logger.info("Mock webhook invoked", {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should log out whether the cert was present so we can assert mTLS vs non mTLS clients are behaving correctly in the integration tests (by parsing the logging as we do for other stuff).

const pem = decodeURIComponent(certHeader);
const cert = new X509Certificate(pem);
const now = new Date();
if (now < new Date(cert.validFrom) || now > new Date(cert.validTo)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually care about this? Its not as though we are checking the cert is even trusted/revoked/etc?
Why check expiry?

headerName: string;
headerValue: string;
};
mtls: {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structure is a bit weird here - aren't mtls and certPinning delivery concerns too?
The impl matches the spec but i think we should go back and change it to:

delivery?: {
    maxRetryDurationSeconds?: number; // per-target retry window; default 7200 (2 h)
    circuitBreaker?: {
      enabled: boolean;               // default true
    };
    mtls?: {
      enabled: boolean;               // present client cert; default true
      certPinning?: {
        enabled: boolean;             // pin server cert by SPKI hash; default true
        spkiHash?: string;            // base64 SHA-256 of server cert SPKI
        // absent when pinning disabled or hash not yet configured
      };
    };
  };

certPinning: certPinningSchema,
delivery: z
.object({
maxRetryDurationSeconds: z.number().min(60).max(43_200).optional(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
maxRetryDurationSeconds: z.number().min(60).max(43_200).optional(),
maxRetryDurationSeconds: z.number().min(60).max(43_200).optional(), // 12 hours

warnings.push("mTLS is enabled but certificate pinning is disabled");
}

if (certPinning.enabled && !certPinning.spkiHash) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has come from the spec but the schema validation in the models package will fail in the http-lambda if pinning is on and there is no hash.
I think this should be an error and the user has to set the hash 1st then turn on pinning.

...commonOptions,
...clientIdOption,
...writeOptions,
"target-id": {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This target-id block is duplicated 3 times.
We should add targetIdOption to tools/client-subscriptions-management/src/entrypoint/cli/helper.ts like we do for the other options and also consume it in target-del which also has it (albeit with slightly different wording that can change).

Comment on lines +67 to +77
if (!config) {
throw new Error(`No configuration found for client: ${argv["client-id"]}`);
}

const target = config.targets.find((t) => t.targetId === argv["target-id"]);

if (!target) {
throw new Error(
`Target '${argv["target-id"]}' not found for client '${argv["client-id"]}'`,
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've got this code pattern repeated on a lot of the targets (get client, throw if no client, get target, throw if no target).
We need the full config so can't refactor all of it away and in some places we expect no config/target.
Could we at least have requireClientConfig, requireTargetConfig.
Its minor but theres a lot of code in this tool so every bit helps

description: "Enable mTLS for this target",
conflicts: "disable",
},
disable: {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In yargs for booleans you can do --enable and --no-enable which we could make use of (https://yargs.js.org/docs/#api-reference-booleankey) which I think would simplify things (the conflict check and enabled declaration at L54)

demandOption: true,
description: "Target identifier to update",
},
enable: {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my other comment - using yargs built in --no-enable for this would simplify things.

}

// Safe as this is an internal tool and this CLI option we are expecting the user will run locally and manually
// eslint-disable-next-line security/detect-non-literal-fs-filename
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// eslint-disable-next-line security/detect-non-literal-fs-filename
// eslint-disable-next-line security/detect-non-literal-fs-filename -- path is provided directly by the operator via CLI arg

Not sure why the comment was removed but we've added a similar disclaimer in target-set-certificate.

Comment on lines +6 to +15
export type DeliveryResult =
| { ok: true }
| { ok: false; permanent: true }
| {
ok: false;
permanent: false;
statusCode: 429;
retryAfterHeader: string | undefined;
}
| { ok: false; permanent: false; statusCode: number };
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
export type DeliveryResult =
| { ok: true }
| { ok: false; permanent: true }
| {
ok: false;
permanent: false;
statusCode: 429;
retryAfterHeader: string | undefined;
}
| { ok: false; permanent: false; statusCode: number };
export type DeliveryResult =
| { outcome: "success" }
| { outcome: "permanent_failure" }
| { outcome: "rate_limited"; retryAfterHeader: string | undefined }
| { outcome: "transient_failure"; statusCode: number };

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks very promising - but where are the tests?!
Happy to leave them out till we get something reasonable looking deployed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants